Aligning, Annotating and Lemmatizing a Corpus for the Validation of Balkan Wordnets

نویسندگان

Harry Kornilakis

Maria Grigoriadou

Eleni Galiotou

Evangelos Papakitsos

چکیده

In this paper we discuss the usage of corpora in the validation of WordNets and we present the exploitation of the Greek version of George Orwell ́s Nineteen Eighty-Four for the construction and validation of the Greek WordNet, which is currently under development in the framework of the BalkaNet project. In particular, we focus on the description of tools that were developed and used for the alignment, the annotation and the lemmatization of the corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer

Sense marked corpora is essential for supervised word sense disambiguation (WSD). The marked sense ids come from wordnets. However, words in corpora appear in morphed forms, while wordnets store lemma. This situation calls for accurate lemmatizers. The lemma is the gateway to the wordnet. However, the problem is that for many languages, lemmatizers do not exist, and this problem is not easy to ...

متن کامل

Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets

The paper presents a method for word sense disambiguation based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in the corpus. The wordnets are aligned to the Princeton Wordnet, according to the principles established by Euro...

متن کامل

Word Sense Disambiguation as a Wordnets' Validation Method in Balkanet

BalkaNet is a European project which aims at the development of monolingual wordnets for five languages in the Balkans area (Bulgarian, Greek, Romanian Serbia, and Turkish) and at improvement of the Czech wordnet developed in the EuroWordNet project. The wordnets are aligned to the Princeton Wordnet, according to the principles established by the EuroWordNet consortium. One of the main concerns...

متن کامل

The Cross-Breeding of Dictionaries

Especially for English, the number of hand-coded electronic resources available to the Natural Language Processing Community keeps growing: annotated corpora, treebanks, lexicons, wordnets, etc. Unfortunately, initial funding for such projects is much easier to obtain than the additional funding needed to enlarge or improve upon such resources. Thus once one proves the usefulness of a resource,...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Aligning, Annotating and Lemmatizing a Corpus for the Validation of Balkan Wordnets

نویسندگان

چکیده

منابع مشابه

Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer

Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets

Word Sense Disambiguation as a Wordnets' Validation Method in Balkanet

The Cross-Breeding of Dictionaries

PAYMA: A Tagged Corpus of Persian Named Entities

عنوان ژورنال:

اشتراک گذاری